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Abstract: This paper investigates vision-based cooperative estimation of a 3D target object 
pose for visual sensor networks. In our previous works, we presented an estimation mechanism 
called networked visual motion observer achieving averaging of local pose estimates in real time. 
This paper extends the mechanism so that it works even in the presence of cameras not viewing 
the target due to the limited view angles and obstructions in order to fully take advantage 
of the networked vision system. Then, we analyze the averaging performance attained by the 
proposed mechanism and clarify a relation between the feedback gains in the algorithm and the 
performance. Finally, we demonstrate the effectiveness of the algorithm through simulation. 
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1. INTRODUCTION 



Driven by technological innovations of smart wearable 
vision cameras, a networked vision system consisting of 
spatially distributed smart cameras emerges as a new 
challenging application field of the visual feedback control 
and estimation (Song et al. (2011); Tron and Vidal (2011)). 
The vision system called visual sensor network brings in 
some potential advantages over a single camera system 
such as: (i) accurate estimation by integrating rich infor- 
mation, (ii) tolerance against obstructions, misdetection in 
image processing and sensor failures and (iii) wide vision 
and elimination of blind areas by fusing images of a scene 
from a variety of viewpoints. Due to their nature, the 
visual sensor networks are expected as a component of 
sustainable infrastructures. 

Fusion of control techniques and visual information has 
a long history, which is well summarized by Chaumette 
and Hutchinson (2006, 2007); Ma et al. (2004). Among 
a variety of estimation/control problems addressed in the 
literature, this paper investigates a vision-based estimation 
problem of 3D target object motion as in Aguiar and 
Hespanha (2009); Dani et al. (2011); Fujita et al. (2007). 
While most of the above works consider estimation by a 
single or centralized vision system, we consider a coop- 
erative estimation problem for visual sensor networks. In 
particular, we confine our focus to a 3D pose estimation 
problem of a moving target object addressed by Fujita et 
al. (2007), where the authors present a real-time vision- 
based observer called visual motion observer. Namely, we 
investigate cooperative estimation of a target object pose 
via distributed processing. 

Cooperative estimation for sensor networks has been ad- 
dressed e.g. in Olfati-Saber (2007); Freeman et al. (2006). 
The main objective of these researches is averaging the 



local measurements or local estimates among sensors in 
a distributed fashion to improve estimation accuracy. For 
this purpose, most of the works utilize the consensus pro- 
tocol (Olfati-Saber et al. (2007)) in the update procedure 
of the local estimates. However, the consensus protocol is 
not applicable to the full 3D pose estimation problem as 
pointed out by Tron and Vidal (2011), since the object's 
pose takes values in a non-Euclidean space. 

Meanwhile, Tron and Vidal (2011); Sarlette and Sepul- 
chre (2009) present a distributed averaging algorithm on 
matrix manifolds. However, applying them to cooperative 
estimation requires a lot of averaging iterations at each 
update of the estimate and hence they cannot deal with 
the case where the target motion is not slow. To overcome 
the problem, the authors presented a cooperative estima- 
tion mechanism called networked visual motion observer 
achieving distributed estimation of an object pose in real 
time (Hatanaka et al. (2011); Hatanaka and Fujita (2012)) 
by using a pose synchronization techniques in Hatanaka 
et al. (2012). However, Hatanaka et al. (2011); Hatanaka 
and Fujita (2012) assume that all the cameras capture 
the target object, which may spoil the advantage (iii) of 
the first paragraph of this section. Though running the 
algorithm only among the cameras viewing the target 
and broadcasting the estimates to the other cameras is 
an option, it is desirable to share an estimate without 
changing procedures of each camera in order to avoid such 
complicated task switches depending on the situation. 

In this paper, we thus present a novel estimation mecha- 
nism which works in the presence of cameras not capturing 
the target due to limited view angles and obstructions. 
Then, we analyze the averaging performance attained by 
the proposed mechanism and clarify a relation between 
the tuning gains and the averaging performance. There, 
we prove that the conclusion of Hatanaka et al. (2011); 
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Fig. 1. Situation under consideration 

Hatanaka and Fujita (2012) under the assumption of per- 
fect visibility is also valid even in the case of imperfect 
visibility. Moreover, we demonstrate the effectiveness of 
the presented algorithm through simulation. 

2. PROBLEM STATEMENT 

2.1 Situation under Consideration 

In this paper, we consider the situation where there are 
n cameras V := {1, ■■■,n} with communication and 
computation capability and a single target object in 3 
dimensional space as in the left figure of Fig. 1. Let the 
world frame, the z-th camera frame and the object frame 
be denoted by T, w , E$ and E OJ respectively. The objective 
of the networked vision system is to estimate the 3D pose 
of the object from visual measurements. Although the 
targets are possibly multiple in a practical situation, we 
confine our focus only to estimation of a single target 
since multiple objects case can be handled by just applying 
parallely the procedure for a single object to each object. 

Unlike Hatanaka et al. (2011); Hatanaka and Fujita (2012), 
all the vision cameras are assumed to have visible region 
and some cameras do not capture the target object as 
depicted in the right figure of Fig. 1. Let us now denote 
the subset of all vision cameras viewing the target at time 
t by V/(t) C V and the rest of the cameras by VJ(t) C V. 

Suppose that the pose consistent with the visual mea- 
surement of each camera i G V/(f) differs from camera 
to camera due to incomplete localization and parametric 
uncertainties of the cameras as depicted in Fig. 2. Then, 
the fictitious target with the pose consistent with the i-th 
camera's visual measurement is denoted by Oj, i € Vf(t) 
and its frame is by E Di , i € Vf(t). Under such a situation, 
averaging the contaminated poses is a way to improve 
estimation accuracy (Olfati-Saber (2007); Tron and Vidal 
(2011)). In this paper, we thus address estimation of an 
average pose of objects {oi}i 6 y in a distributed fashion. 

2.2 Relative Rigid Body Motion 

The position vector and the rotation matrix from i-th 
camera frame E^ to the world frame T, w are denoted by 

p wi € TZ 3 and e^' e ™> e SO(3) := {R £ TZ 3x3 \R T R = 
RR T = J 3 , det(.R) = +1}. The vector £ wi G TZ 3 specifies 
the rotation axis and 9 W { £ TZ is the rotation angle. We 
use £6 W i to denote £, W i9 W i. The notation 'A' is the operator 
such that ab = axb, a,b € TZ 3 for the vector cross-product 
x , i.e. a is a 3 x 3 skew-symmetric matrix. The notation 
'V' denotes the inverse operator to 'A'. 

The pair of the position p W i and the orientation wi 
denoted by g wi = (p m ,e^) £ SE{3) := TZ 3 x S*0(3) 
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Fig. 2. Sensing under uncertainties 

is called the pose of camera i relative to the world frame 
Yj w . Similarly, we denote by g WOi = (p WOi , e^™ * ) € SE(3) 
the pose of object Oi relative to the world frame T, w . 
We also define the body velocity of camera i relative to 
the world frame T, w as = (v W i,ui W i) £ TZ 6 , where 
v w i and u) w i respectively represent the linear and angular 
velocities of the origin of Ej relative to T, w . Similarly, 
object Oj's body velocity relative to T, w is denoted by 
V* 0i = (v WOi ,u WOi )€Tl 6 . 

Throughout this paper, we use the following homogeneous 
representation of g — (p,e^ e ) E SE(3) and V b — (v,lu). 
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Then, the body velocities V^J i and V b D . are simply given 
b y Vwi = 9Zt9wi and V b a . = g'ijwof 

Let gi 0i = (pi 0i , e^ 9io i ) € SE(3) be the pose of E Di relative 
to Ej. Then, it is known that gi 0i can be represented as 
9io t = 9wi9wo z - By using the body velocities and V b D ., 
the motion of the relative pose gi 0i is written as 



~V w i9iOi + 9iOiV wo . 



(1) 



(Ma et al. (2004)). (1) is called relative rigid body motion. 



2.3 Visual Measurement 

In this subsection, we define visual measurements of each 
vision camera i G V/(i) which is available for estimation. 
Unlike Hatanaka et al. (2011); Hatanaka and Fujita (2012), 
all the cameras in VJ(t) obtain no measurement. Now, 
we assume (i) all cameras are pinhole-type cameras, (ii) 
each target object has m (m > 4) feature points and (iii) 
each camera can extract them from the vision data. The 
position vectors of object Oj's l-th feature point relative 
to E Di and E, are denoted by p 0i i £ TZ 3 and pu € TZ 3 
respectively. Using a transformation of the coordinates, we 
have pu = gi 0i p 0i i, where p 0i i and pu should be regarded 
with a slight abuse of notation as \p£ t 1] T and \pj, 1] T . 

Let the m feature points of object Oi on the image plane 
coordinate be the measurement /j of camera i, which is 
given by the perspective projection (Ma et al. (2004)) as 
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Fig. 3. Relative rigid body motion with camera model 
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(2) 



with a focal length , where pu — [ xu yu zu ] . In this 
paper, we assume that each camera i £ V knows the 
location of feature points p 0i i £ 1Z 3 . Then, the visual 
measurement fi depends only on the relative pose g lDi from 
(2) and pa = gi 0i p 0i i- Fig. 3 shows the block diagram of 
the relative rigid body motion (1) with the camera model 
(2), where RRBM is the acronym of Relative Rigid Body 
Motion. 

2.4 Communication Model 



9* = (P*,e i9 ) := arg jnm^ ^{g 1 g WOj ) (5) 



geSE(3) 



where tp is defined for any g = (p, e> ) £ SE(3) as 

Hg)--=\\\h-g\\l = \\\ P \\ 2 + cp{eP), (6) 

^)-.= \\\h-eV\\l = ^{h-^) (7) 



and is the Frobenius norm of matrix M. Hereafter, 

we also use the notation g* = (p£, e^ e < ) := g*. 

3. NETWORKED VISUAL MOTION OBSERVER 

In this section, we introduce a cooperative estimation 
mechanism originally presented by Hatanaka et al. (2011). 
Here, we assume that the relative poses gij = g~\ g W j w.r.t 
neighbors j £ Mi are available for each camera i £ V. 



The cameras have communication capability with the 
neighboring cameras and form a network. The commu- 
nication is modeled by a graph G = (V,£), £ C V X V. 
Namely, camera i can get information from j if (j, i) £ £ . 
We also define the neighbor set Mi of camera i £ V as 

Mi~{j£V\(j,i)££}. (3) 

In this paper, we employ the following assumption on the 
graph G. 

Assumption 1. The communication graph G is fixed, undi- 
rected and connected. 

We also introduce some additional notations. Let T(io) be 
the set of all spanning trees over G with a root io £ V 
and we consider an element Gt = (V,£t) € T{io). 
Let the path from io to node i £ V along with Gt be 
denoted by P GT (i) = (vq,-- ■ ,Wd GT (i)), «o = io, v<i GT (i) = 
i, (vi,vi+i) £ &r VI € {0, • • • ,d GT (i) - 1}, where d GT (i) is 
the length of the path Pq t {i). We also define 

A (E- i) — < ^ ^ e P at ^ (') includes edge £/ 
Gt ^ ' ' — \ 0, otherwise 

for any E £ £t and 



W := mm D(i ), D(i ) := mm D{G T ), (4) 



£>(G T ) := max V A Gr (E; i)d GT (i) 



Ee£ T 



The meaning of these notations are given in Hatanaka and 
Fujita (2012). 

2.5 Average on SE(3) 

The objective of this paper is to present a cooperative 
estimation mechanism for the visual sensor networks pro- 
ducing an estimate close to an average of {gi 0j }iev. jev f , 
9ioj '■= gZidvioj even in the presence of vision cameras not 
capturing the target. 

Let us now introduce the following mean g* on SE(3) 
(Moakher (2002)) as an average of target poses {gw 0j }jev f ■ 



3.1 Review of Previous Works 

We first prepare a model of the rigid body motion (1) as 

9i = -VwiSi + grtei, (8) 

where cji = (pi , e^ 9i ) is the estimate of the average g* = 
9wi9*- The input u ei = (v uei ,uj ue i) is to be designed 
so that g~i approaches g*. Once g~i is determined, the 
estimated visual measurement fi is computed by (2). 

Let us now define the error g e i := g~ 1 gio i between the 
estimate g~i and the relative pose gi Qi and its vector 
representation e e i :— En(g e i) with 



E R (g) :-- 



sk(e«*):^(e^ 



, efl (e« 9 ) := sk(e^) 



few 



)■ 



(9) 



It is shown by Fujita et al. (2007) that if the number of 
feature points m is greater than or equal to 4, the estima- 
tion error vector e e i can be approximately reconstructed 
by the visual measurement error f e i := ft — fi as 



&ei — J{ (9i)fe 



(10) 



In case of a single camera, Fujita et al. (2007) presents 
an input u e i = k e e e i based on passivity of the estimation 
error system from u e i to —e e i and the resulting estimation 
mechanism (8), (10) and u e i = k e e e i is called visual 
motion observer. Then, the authors prove the estimate g~i 
converges to the actual relative pose gi 0i if V^ . = 0. 

Hatanaka et al. (2011) extended the results in Fujita et 
al. (2007) to the networked vision systems, where the 
following input to the model (8) was proposed. 

u ei = k e e ei + k s 2J E R (gT x g itj )^ k e > 0, k s > (11) 

with gi j :— gij<jj- The input consists of both vi- 
sual feedback term k e e e i and mutual feedback term 
k s JVeWi Eiiig^gij) inspired by pose synchronization in 
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Fig. 4. Networked visual motion observer 

Hatanaka et al. (2012). The resulting networked estima- 
tion mechanism (8), (10) and (11) is named networked 
visual motion observer. Then, the paper analyzed the av- 
eraging performance attained by the proposed mechanism. 



4. AVERAGING PERFORMANCE 

In this section, we derive ultimate estimation accuracy 
of the average g* achieved by the presented mechanism 
assuming that the object is static (V^ a . = Vi € V/). 
Throughout this section, we use the following assumption. 

Assumption 2. 

(i) The number of elements of V/ is greater than or equal 
to 2 (|V/| > 2) and there exists a pair £ V/ x V/ 

such that p WOi 7^ Vwoj and e^ 9mo i ^ e^ e ™°j . 

(ii) e -«X* e ^" S > for all i G V/. 

The item (i) is assumed just to avoid a meaningless prob- 
lem such that all the poses in {g WOi }i€Vf are equal under 
which it is straightforward to prove convergence of the 
estimates to the common pose by using the techniques pre- 
sented by Hatanaka et al. (2012). The detailed discussions 
on validity of the assumption (ii) is shown in Hatanaka and 
Fujita (2012) but it is in general satisfied in the scenario 
of the beginning of Section 2. 



3.2 Networked Visual Motion Observer under Imperfect 
Visibility 

In the presence of the cameras not capturing the target, 
i € VJ{t) cannot implement the visual feedback term k e e e i 
in (11). We thus employ the following input instead of (11). 

Uei = Si(t)k e e ei + k s 22 Eaig^gij), (12) 

where 8i(t) = 1 if i € V/(i) and 5i(t) — otherwise. 
The total estimation mechanism is formulated as (8), (10) 
and the inputs (12) whose block diagram with respect to 
camera i is illustrated in Fig. 4. 

The input (12) for i 6 IS tne gradient de- 

cent algorithm on SE(3) of the local objective function 
YljeNi ^(97 9id) (Absil et al. (2008)), which means each 
camera in VJ{t) aims at leading its estimate g~i to its 

neighbors' estimates {9i,j}jeJ^i- O n the other hand, the 
input for i £ V/(i) aims at leading g~i to both of object 
pose gi Qi and neighbors' estimates. Meanwhile, the global 
objective is given by (5), which differs from the local objec- 
tive functions. Thus, the closeness between the estimates 
and the global objective minimizer g* is not clear. 

In the next section, we thus clarify the averaging perfor- 
mance. Although it is conjectured from its structure and 
demonstrated through simulation (http : //www . f 1 . Ctrl . 
titech . ac . jp/researches/movie_new/ sim/ sw_coopest . 
wmv) that the present mechanism works for a moving ob- 
ject, we will derive a theoretical result under the assump- 
tion that the target object is static (V% . = 0). The main 
reason to use this assumption is to assure time invariance 
of Vf(t). Indeed, in case of the time varying Vf(t), the 
global objective itself changes in time and it is necessary 
to find a metric evaluating the performance in order to 
conduct theoretical analysis, which is left as a future work 
of this paper. 



4-1 Definition of Approximate Averaging 

In this subsection, we introduce a notion of approximate 
averaging similarly to Hatanaka and Fujita (2012). For this 
purpose, we define parameters 

„*l|2 



Pp : = 



E lb* 



Pi 



PR 



E« 

iev/ 



) 



and the following sets for any positive parameter e 



)iev|-E ( 



Let us define e- level averaging performance to be met by 
the estimates gt = (pi,e^ 6i ). 

Definition 3. Given target poses (ffioJigV/ and e > 0, 
the position estimates (pj)jgy and orientation estimates 

(e ?ei )i £ v are respectively said to achieve e-level averaging 
performance, if there exists a finite T such that 

(Pi(*))iev G Qp(e) and (e^(«)) ieV € Sl R (e) Vf > T. 

In case of V/ = V, p p and pn indicate average estimation 
accuracy in the absence of the mutual feedback term of u e j 
in (12) since the visual motion observer correctly estimates 
the static object pose gi 0i . In the case, the parameter 
e is an indicator of improvement of average estimation 
accuracy by inserting the mutual feedback term. 

4.2 Averaging Performance Analysis 

In this subsection, we state the main result of this paper. 
For this purpose, we first define a value 



max< 

iev f 



and a parameter £ > strictly greater than . 
have the following lemma. 



Then, we 



Lemma 4- Suppose that 

the targets are static (V^J . = Vi G V/) and the estimates 
(Si)ieV are updated according to (8) and (12). Then, under 

Assumptions 1 and 2 and e - ^ * > Vt > 0, there exists 

a finite t such that <j>{e~&t e ?&) < ( Vt > t, i e V. 

Proof. See Appendix A 

The proof of Lemma 4 means that the set 

S = {(e&)i6v|e _ *M fl * > Vz e V} 
is positively invariant for (8) with (12). 

We are now ready to state the main result of this section. 
Theorem 5. Suppose the targets are static (V^ Q . = 
V« € V/) and the estimates (gi)ieV are updated accord- 
ing to (8) and (12). Then, under Assumptions 1 and 2 
and j3 := 1 — \/2( > 0, if the initial estimates sat- 
isfy (e 5 i (0))i £ y € 5, for any e e (0,1), there exists 
a sufficiently small k — k e /k s such that the position 
estimates (pj)iev achieve e-level averaging performance 

and the orientation estimates (e* )iev achieve ER-level 
averaging performance with £r = 1 — (1 — e)/3. 

Proof. See Appendix B. 

Theorem 5 says that choosing the gains k e and k s such that 
k = k e /k s is sufficiently small leads to a good averaging 
performance. The conclusion is the same as Hatanaka et 
al. (2011); Hatanaka and Fujita (2012) and hence the 
contribution of the theorem is to prove the statement is 
also valid even in the presence of the cameras not viewing 
the target. We also see an essential difference between 
the position and orientation estimates that the averaging 
performance on positions can be arbitrarily improved by 
choosing a sufficiently small k but an offset associated with 
(3 < 1 occurs for the orientation estimates. 

The energy function Ur in (B.l), which allows us to prove 
Theorem 5, is defined by the sum of individual error 
between the average and the estimate. The selection of 
this function is inspired by Chopra and Spong (2006). 

5. VERIFICATION THROUGH SIMULATION 

We finally demonstrate the effectiveness of the present 
algorithm through simulation. Here, we consider five pin- 
hole type cameras with focal length 0.03 m connected by 
the communication graph with £ = {(1, 2), (1, 3), (1, 5), 
(2, 3), (3, 4), (4, 5)}. We identify the frame of camera 1 
with the world frame and let p W 2 — [1 0] T , p w ^ = 
[0 10] T , Pw4 = [-10 0] T , Pw5 = [0-1 0] T and 
e ie wi _ j 3 y i g {2,3,4,5}. Let only cameras i = 1,2,3 
(gray boxes in Fig. 5) capture the target, i.e. V/ = {1, 2, 3}. 

We set the configurations of target objects as p WOl = 
[0.55 1.00 -1.91] T , Pw02 = [0.30 0.80 -1.84] T 
IT ca — rn Qn n iq n oi IT 



Pu 



0.30 0.19 0.21 V , £_0 WO2 



Pwo 2 

t 0.56 1.05 -2.00 ] T , £9 WOl 

[0.21 0.30 0.19 ] T , £e wos = [0.29 0.20 0.31 } T . The red 
boxes in Fig. 5 represent the initial configuration of target 
objects and yellow boxes represent the cameras VJ = 

{4,5}. Then, the average g* = (p*,e^ ) is given by 
p* = [0.47 0.95 -1.92 ] T , £6* = [0.27 0.23 0.24 ] T . 
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Fig. 6. Time responses of estimation error energies for 
k s = 1 and k s = 50 

We run simulations with two different gains k e = 1, k s = 
1 (k = 1) and k e = 1, k s = 50 (k = 0.02) from the initial 

condition pi(0) = [0 1] T and e&(0) = J 3 Vi. Fig. 6 
shows the time responses of the position estimation error 
energy 

1 x ^ 



2^ 

iev 



WPi-Pi 



and orientation estimation error energy Ur defined in 
(B.l), where the red solid curves illustrate the result for 
fc s = 50 and the blue dashed curves that for k s = 1. 
We see from both figures that the energies for the larger 
mutual feedback gain k s = 50 are smaller than those for 
k s — 1, which implies that a large k s and hence a small 
k achieves a good averaging performance as indicated by 
Theorem 5. Fig. 7 illustrates the time responses of the 
first element of orientation estimates £ sm(9 W} i)(e^ 6u '- i = 

™ 4 * ) of all cameras produced by the networked visual 
motion observer, where the red dash-dotted line represents 
the average. We also see from the figure that, while the 
estimate of camera 2 for k s = 1 is far from the average, 
all the estimates for k s = 50 approaches to it. However, 
we also confirm that an offset still occurs even in case of 
k a = 50 as indicated in Theorem 5. 

6. CONCLUSION 

In this paper, we have investigated a vision-based coop- 
erative estimation problem of a 3D target object pose for 
visual sensor networks. In particular, we have extended the 
networked visual motion observer presented by Hatanaka 
et al. (2011) so that it works even in the presence of cam- 
eras not viewing the target due to the limited view angles 
and obstructions. Then, we have analyzed the averaging 
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performance attained by the present mechanism. Finally, 
we have demonstrated the effectiveness of the present 
algorithm through simulation. 
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Appendix A. PROOF OF LEMMA 4 

In the proof, we use the following lemma. 

Lemma 6. (Hatanaka et al. (2012)). For any matrices R\, 

i?2,^?3 € 50(3), the inequality 



-ti(R^R 2 - rTr 3 R%R 3 ) > <f>(R%R 3 ) - (/>(RfR 2 



+ \ min (sym{R[R 3 ))(j>(R 3 r R 2 ) (A.l) 

holds, where sym(M) := \{M + M T ) and A mm (M) is the 
minimal eigenvalue of matrix M. 

Extracting the time evolution of the orientation estimates 
from (8) with (12) and transforming their coordinates from 

Sj to E w as e^"-< = e^e^ yields 



(A.2) 



uj uei = 5ik e e R (e ?™.* e «™°;) 



tf-Metf-i), (A.3) 



+k s ^ e n( e 

which is independent of evolution of the position estimates. 
Let us now consider the energy function 



U := cb(e-^ e^ 1 > = 



<) =0(e~ ?e e^-'), (A.4) 

similarly to Lemma 1 in Hatanaka and Fujita (2012), 

where l(t) :— argmaxigy 4>{e~^ Bi e^ (t)). The time deriva- 
tive of U along with the trajectories of (A.2) with (A.3) is 
given as 



U = 2e T R (e 



u el 



= -tr 



(sk( 



e ^'e^')u uel 



(A.5) 



where we use the relation a T b = — ^tr(afr). Substituting 
(A.3) into (A.5) yields 

l/=-Itr{j, (t) jfe e (e-^'e^«-« -e-^'e^-ie-^-'e^i) 
+ k s J2 (e- ie *e&^ - e-^J^.ijK^Je^y (A . 6) 

From Lemma 6, (A. 6) is rewritten as 

U < -6 m k e A t - k s B h 

where 

Ai := 4>(e-^J § ^) - 0(e-^V°«"«) + ^(e-^-'e^*), 
v% ■= A mi „(sym(e~ 5e e^™- 1 )), 

+ a i( j>(e-^J 5 ^)). 

The inequality 0( e -&* e&»-') > 0( e -&*e&»J) Vj e V 
holds from the definition of the index I, and hence we ob- 
tain Bi > X^eM ^i4 > { e ~^ 6w ' le ^ ew ' i )- Thus, the inequality 

U < -(*i( t ) A e 0(e-^'e^-) - <5 ;(t) fc e( /)(e-« r e^) 

is true. From Assumption 2, we have 07 > and hence 

f/< -J J(t) A: e ^(e-f e *ef fl ».') - # e -#V fl -.)). (A.7) 

Suppose now that l(t) G V/. Then, if (j>(e~^ e * e& w ->) > (, 
U < is true from the definition of £. On the other hand, 
in case of Z(t) G Vy, we also have J7 < 0. Namely, the 

function U never increases as long as an estimate e^™^ 
satisfies (j)(e~^ e e ? e ™,;) > This implies that once the 
estimates (e^-^iev enter 

5 C := {(e&-') ieV | 0(e-^'e^-') < ( Vz G V} 

at a time, (e^ ^igy stays in the set for all subsequent 
time. 

Let us now employ another energy function 

V:=Y W^'^'l-C), (A.8) 

i€A(t) 

A(t) := {1 G V|0(e"« r e&»-') > (}• (A.9) 
The function F > is continuous but it may not be 
diffcrentiable on the region where an estimate e^™-* sat- 
isfies <p(e"^ e " e^ 6 ™' 1 ) — (. Except for the region, the time 
derivative of V along with the trajectories of (A. 2) is given 

by 

V<-k e £ A t -k s E (A.10) 

*€A(t)nV/ ieA(t) 



We first consider 

E E (^-^c^.«)-0(e-^e^.o) (AH) 
»eA(t) 3'eAf, 

in the second term of . In case of j £ A(t), we have 
Otherwise (j G A(i)), the term 

has to appear in (A. 11) under Assumption 1 and they are 
canceled. Thus, the inequality 

E E 4>{e~ ie *e^) - 4>{e-^e^) > 
ieA(t) jeM 

holds and hence we obtain 

E^> e Y^'^^y (A-i2) 

ieA ieA(t) jeM 

Note that the equality of (A. 12) can hold only if A(t) = V 
or A(t) = since, otherwise, there must be a pair of 
(i, j) G 5 such that j G' A(t) and i € A(i) from Assumption 
1. 

Now, substituting (A. 12) into (A. 10) yields 

V < -k e E (<Ke- ie 'J 9 ~- i ) - 0( e -#V*«»«) 
ieA(t)nV/ 

+ ^(e-^-«e^-i))-fc a E ^ E 0(e- ?ff -'e^-O. 

i€A(t) jeM 

We see from the inequality and the definition of £ that 
if there exists i G A(f) n V/ then V' < 0. In addition, if 
A(t)(lVf — 0, then the inequality (A. 12) strictly holds and 
hence V < 0. Namely, the function V is strictly decreasing 
except for the region where an estimate e^ 6 ™^ satisfies 

0(e-#*e&"-')=C- (A.13) 

Since the function V is continuous despite of the event 
that an estimate goes across the region, the function V 

decreases and the estimates (e ?9ro i )i e v enters 5^ at least 
once as long as the time interval 

T b := {t > 0| 3i G V satisfying (A.13)} 

is bounded and the number of occurrences of the event 
(A.13) is finite. As proved above, if all the estimates enter 
«Sf once they have to stay there for all subsequent time. 

Let us now define V' and A'(t) by just replacing £ by 
C = 4>m + (C — ^m)/2- Then, all the above discussions 
hold true. Notice that every time an estimate goes across 
the region of (A.13), the estimate has to spend nonzero 

finite time in the region where 4>(eT'^ B *"'*) G (C', C) from 

continuity of w ' i and ||w ue i|| < 00. During the period, 
the function V is strictly decreasing. Namely, if the event 
happens infinitely often, V — > —00, which contradicts 
V > 0. The possibility that 7], is unbounded is also 
excluded in the same way. This completes the proof. 



Appendix B. PROOF OF THEOREM 5 

In this paper, we prove only the orientation part since it 
is possible to prove the position part in the same way. The 

evolution of orientation estimates (e^ Sro i )i e v is described 
by (A.2) with (A.3). 

We first define the energy function 

U R :=524>{e-to&) = ^(e-^VH (B.l) 
iev iev 

and the sets 

5!(e) := {(e^)iev e s\ £ ^"^e^) > sp R }, 

iev, 

S 2 := {(e&-%v ^-'^-0 > *f). 

ieVjeM p 

Then, we first prove the following lemma. 

Lemma 7. Suppose that all the assumptions of Theorem 5 
hold. Then, there exists a sufficiently small k such that the 
time derivative of Ur along with the trajectories of (A.2) 
and (A.3) satisfies U R < at least after the time r in the 
region where (e^'^gy G (Si(e' R ) U 1S2) with 

Proof. The time derivative of Ur along the trajectories 
of (A.2) and (A.3) is given by 

U R = 2^e T R (e-^J S ^)u; uei 
iev 

= -tr (sk(e-« r e^)w„ei) • (B.2) 
Substituting (A.3) into (B.2) yields 

U R = -k e ^tr($!)-fc s ^tr($ 2 ), (B.3) 

iev, iev 

<!>, ^£ ( e -^' e l««. J _ e -^'e^-'e-^Je^-'). 
jeM 

We first consider the term X^iev * r ( < ^ >2 )- F rom Lemma 6, 
the following inequality holds. 

EM<f2)>EE{^" |e * el ^) 

iev iev jeAfi 

-0(e-« V e^) + o-i0(e-&"«e&»-')}. (B.4) 
Assumption 1 implies that 

(Hatanaka et al. (2012)). From Lemma 4, (B.4) is rewritten 

as 

£ tr($ 2 ) > £ £ ^(e-^-'e^-O (B.5) 
iev iev jeMi 



at least after time r similarly to Hatanaka and Fujita 
(2012). 

We next consider the term XieV/ tr($i) in (B.3). In the 
same way as Hatanaka and Fujita (2012), we can prove 

E tr ( $ i) > E {*(e-*'e&-') - 0(c-^V fl «.) 

i€V/ i€V/ 

+/30(e-^- l e« e ™°o}- 

= -/>*+£ {^'^')+W^^-)}(B.6) 

iev/ 

at least after time r. Substituting (B.6) and (B.5) into 
(B.3) yields 

U R <-k e £ {^'^JlMe- 15 -'^))} 

ieVy 

+fc ePiJ - fc s E E Me-^e^-)- (B.7) 
iev jeM 

From (B.7), /3 > and the definitions of the sets <Si(e) 
and 52, we have Ur < in the region <Si(l) UiS2- Namely, 
the remaining task is to prove Ur < in the region of 
Si(e' R )\S 2 . 

Equation (B.7) is also rewritten as 

Ur < kePR+Y, fc e (-#e-#*e&"') 
iev, 

-/3(l-e)0(e-^«-'e^«»«)) ( B -») 

a* :=/?(£ fc e #(e-^-e« e -0 
iev/ 

iev jeM 

where a_R is strictly positive under Assumption 2. Now, for 
any a e (0, 1) and j* e V, we have 

0(e-^«-«e^«"') > a0(e-^-.^e^-»«) 

E_0( e -C9«,,i. e le«..*). (B.10) 

1 — a 

Let j* be a node satisfying j* = arg min io D(i ) and 
G* T = (V£*) e T(j*) be the graph satisfying G* T = 
argmin GTe7 -(j.\ D(Gt)- Then, we obtain 

0( e -IW e &«.«) < 

dG.(i) £ 0(e-^«.. 1 (o e l»«.., +1 w )) 
ie{o,-,d G ^(*)-i} 

where (v n (i), • • • , u dG „ (i)-i(«)) i s the path from root j* to 
node i along the tree GJ. Namely, 

X>(e-&-'-e&-')< 

£d G .(i) £ 0( e -^-.. lWe «X.. 1+1 (o) 
jev f ie{o,-,d G , (»)-!} 



holds. For any edge E = (V,v 2 ) of Gj,, the coefficient To compute an upper bound of the optimal value, we relax 

of 0(e-*W*«.«') in the right hand side of (B.ll) is the constraints ( B - 18 ) as 

given by J2iev f Ag* (E; i)d G ^ (i), which is upper-bounded ^ ^ ^ / 

by D(G* T ) = W. We thus have ^e~^e^) < -|p Vj e M- (B.19) 

_^ 1^ ~ _|g |g Any two nodes are connected by a path over graph G 

/ t 4>( e ™- 3 'e W -')<W w < vl e ™," 2 ) whose length is smaller than the diameter of the graph 

ieVf E=(v\v 2 )e£ T G denoted by diam(G). Thus, (B.19) implies that 

ievjeK cj>{e-^'e^) < a R := IR — v ; Vi,j e V.(B.20) 

The latter inequality of (B.ll) holds because G^ is a AT , , ,. n Tf , a n T 

, , c „ n ■' v 7 J Note that limj-^n «» = 0. if we define R; = 13 — 

subgraph of G. f „, 

e-« e e^».% the problem (B.16), (B.17), (B.20) is rewritten 

Suppose that (e^° w ' i )i e y e <Si(e^)\<S2. Then, the inclusion as 

( e £0 Wji \ a g holds and hence 1 

EE«^^t- (R12) , J!T 

i€Vj&/V, p subject to - Hi < £> fl , (B.22) 

From the definition of average , we also have * eV/ 

J||i2 i -i2 j |||.<aflV*,jeV. (B.23) 



For any i £ V and j € V/, (B.23) implies that 



^<Xe-« e ^*e^)>Pfl- (B.13 

Substituting (B.10), (B.ll), (B.12) and (B.13) into (B.8) \\Ri\\F < \\Ri - Rj\\f + \\Rj\\f < V^^r + \\R 3 \\f 

yields 



Now, it is clear that the optimal solution to (B.21) has to 
TT < h rh( ~t 6 ' & \ satisfy ||i?i||F 7^ Vi e V and there exists a sufficiently 

UrS He 2^n e e •) sma ii k such tha t HiLII*. - -k/2a^ >0. We thus obtain 



Wl - (1 - e)(a/» - f^)W - a,. (B.14) ^'K"*"* " ^ 4 E H*^ * ^ 
V 1 — a / 

Since (B.14) holds for all a e (0,1), if we set a = 1 — ^ ^ 

^/fcW 7 //? for a sufficiently small k satisfying 1 — y/kW/fi G ' 

(0,1), we obtain ||i*||F < V^(V^+ ^WIW 

Ur<-K <t>(e~* e * e^ 6 "-') + k e e' R p R - a R . Namely, the optimal value of (B.21) is upper bounded 

ieV i by (y/a R + \/£#Ar||V/|||) 2 and hence e is also bounded 

Moreover, because of (e^'Oiev e5i(ey, b / W a R\ v f\/PR + \Rr?- Since lim fc->o a R = and 



lim fe ^o £r = £r, 

lim( J a R \V f \/ pr + J e' R ) 2 = s R 



fc->0 



U R < -a R < 
holds true. This completes the proof. □ 

We are now ready to prove Theorem 5. We immediately holds. This completes the proof, 
see from Lemma 7 that the trajectories of orientation 

estimates {e^ 6w ' i )i^v along with (A. 3) settle into the set 
Qr(s) with e satisfying 

5 3 (4) := S \ (Si (4) U S 2 ) C n R (s). (B.15) 

Let us next derive an upper-bound of the minimal e 
satisfying (B.15). For this purpose, we consider 

max -V^'^'-) (B.16) 

(e 5 •'•).ev ieV 

subject to ^ &(e-&' e^™-*) < s' R p R , (B.17) 
2 2^(e-^)<^.(B.18) 



